import os
from datetime import datetime
import time
from tqdm import tqdm
import pandas as pd
import spacy
import re
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance, KeyBERTInspired
from sentence_transformers import SentenceTransformer
# from umap import UMAP
from cuml import UMAP
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
# from hdbscan import HDBSCAN
from cuml.cluster.hdbscan import HDBSCAN
import plotly.io as pio
pio.renderers.default = "notebook+vscode+jupyterlab"
sns.set_theme(style="darkgrid")
# %config InlineBackend.figure_format = "retina"
# Dictionaries:
# en_core_web_sm
# en_core_web_md
# en_core_web_lg
# en_core_web_trf
nlp = spacy.load(
"en_core_web_sm",
exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"],
)
spacy_stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)Dynamic Topic Modelling of r/politics subreddit
This project allows:
- To gather data from Reddit and save it in csv format
- To clean gathered data and explore it
- To extract the main topics from gathered data
- To visualise dynamic changes of topics over time
To extract topics, we use BERTopic library, which performs topic modeling using clustering of vector representations of documents. The main differences between BERTopic and other topic models:
- High speed due to reducing the dimensionality of vector representations.
- Modular structure of the model pipeline: the stages of vectorization, dimensionality reduction and clustering are separated from each other, which allows you to easily and quickly experiment with different combinations of algorithm settings.
- The model pipeline consists of SOTA tools: SBERT, UMAP, HDBSCAN. Combined, this allows you to get the best results compared to other models.
This project can be easily adjusted to other sources of information, which allows you to conduct different experiments.
Install libraries
Load and clean data from csvs
BERTopic uses Transformers. The model learns better if it receives more information from the text. Therefore, preprocessing is minimal.
Function to clean data from HTML elements using regular expressions
def regex_preprocessing(text):
# Remove URL
text = re.sub(
r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
" ",
text,
)
text = re.sub(
r"\(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\)",
" ",
text,
)
# Remove special symbols
text = re.sub(r"\n|\r|\<.*?\>|\{.*?\}|u/|\(.*emote.*\)|\[gif\]|/s|_", " ", text)
text = re.sub(r"[^\w0-9'’“”%!?.,-:*()><]", " ", text)
# Remove unnecessary brackets
text = re.sub(r"\s\(\s", " ", text)
# Delete unnecessary whitespaces
text = re.sub(r"\s+", " ", text)
return text.strip()Function to convert data to a dataframe, drop duplicates in the dataframe and to apply ‘regex_preprocessing’ function to data
def data_preprocessing(file_name):
data = pd.read_csv(file_name)
data_cleaned = data.drop_duplicates(keep=False)
data_cleaned["comments"] = data_cleaned["comments"].apply(regex_preprocessing)
return data_cleanedFunction to create a dataframe with a cleaned data
This function consists of several steps:
- Firstly, it gets names of csv files in a chosen folder
- Secondly, it applies ‘data_preprocessing’ function to csv’s to create dataframes with cleaned data
- Lastly, it creates a combined dataframe with cleaned data
def process_data(directory):
file_names = []
for filename in os.listdir(directory):
file = os.path.join(directory, filename)
file_names.append(file)
file_names.sort()
dataframes = []
for name in file_names:
dataframes.append(data_preprocessing(name))
cleaned_df = (
pd.concat(dataframes)
.drop(columns="time", axis=1)
.reset_index(drop=True)
.drop_duplicates()
.dropna()
)
return cleaned_dfApply data processing functions to gathered data
For this experiment, we load cvs’s with data marked as ‘hot’ by reddit algorithms.
directory = "original_data/hot"
combined_df = process_data(directory)
len(combined_df["comments"].to_list())264230
Convert the dataframe to a list for a further work
comments = combined_df["comments"].to_list()
timestamps = combined_df["date"].to_list()Create embeddings from cleaned data
The gte-small model was chosen using the Hugging Face benchmark. It is lightweight and works well with Reddit data.
# Pre-calculate embeddings
embedding_model = SentenceTransformer(
model_name_or_path="thenlper/gte-small",
cache_folder="transformers_cache",
)
embeddings = embedding_model.encode(comments, show_progress_bar=True)Plot data distribution
We use umap to reduce the dimensionality of the data, which makes it easier to cluster the data using HDBSCAN.
def plot_umap(embeddings, values):
neighbors_list = values
fig, axes = plt.subplots(2, 5, figsize=(27, 10), sharex=True, sharey=True)
axes = axes.flatten()
for ax, neighbors in tqdm(zip(axes, neighbors_list)):
umap_model = UMAP(
n_neighbors=neighbors, n_components=2, min_dist=0.0, metric="cosine"
)
# Apply UMAP to our data
umap_result = umap_model.fit_transform(embeddings)
# Visualise the results
ax.scatter(
umap_result[:, 0], umap_result[:, 1], alpha=0.15, c="orangered", s=0.1
)
ax.set_title(f"UMAP, n_neighbors = {neighbors}")
ax.set_xlabel("Компонента 1")
ax.set_ylabel("Компонента 2")
lim = 7
plt.ylim(-lim, lim)
plt.xlim(-lim, lim)
plt.tight_layout()
plt.show()
def plot_hdbscan(embeddings, umap_values, hdbscan_values):
for n in umap_values:
# Apply UMAP to our data
umap_model = UMAP(n_neighbors=n, n_components=2, min_dist=0.0, metric="cosine")
umap_result = umap_model.fit_transform(embeddings)
# HDBSCAN
sizes = hdbscan_values
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharex=True, sharey=True)
axes = axes.flatten()
for ax, size in tqdm(zip(axes, sizes)):
# Cluster data with HDBSCAN
hdbscan_model = HDBSCAN(
min_cluster_size=size, metric="euclidean", prediction_data=True
)
hdbscan_labels = hdbscan_model.fit_predict(umap_result)
# Create a dataframe with results of UMAP and HDBSCAN
df = pd.DataFrame(
umap_result, columns=[f"UMAP{i+1}" for i in range(0, 2, 1)]
)
df["Cluster"] = hdbscan_labels
# scatterplot for results
sns.scatterplot(
x="UMAP1",
y="UMAP2",
hue="Cluster",
data=df,
palette="tab10",
legend=None,
linewidth=0,
s=0.5,
ax=ax,
).set_title(f"n_neighbors={n}, min_cluster_size={size}")
ax.set_xlabel("Компонента 1")
ax.set_ylabel("Компонента 2")
lim = 7
plt.ylim(-lim, lim)
plt.xlim(-lim, lim)
plt.tight_layout()
plt.show()We plot a range of values to see how a structure of data changes: from a more local structure to a global one.
plot_umap(embeddings, np.arange(10, 56, 5))10it [01:24, 8.42s/it]

We can see sizes of created clusters with different parameter combinations.
plot_hdbscan(embeddings, [15, 20, 25], [15, 35, 50, 75])4it [02:08, 32.09s/it]

4it [02:28, 37.09s/it]

4it [02:16, 34.07s/it]

Extract topics using BERTopic
In this work, we use MaximalMarginalRelevance topic representation model, which changes the order of words in topics to remove semantic repetitions and create a sequence of the most significant words.
We use CountVectorizer from scikit-learn to:
- remove very rare and frequent words from the final topic representations
- create n-grams, up to 2 words in total
- remove stopwords using spaCy stopwords list
Function for Topic Modelling Pipeline
def topic_modelling(n_neighbors, min_cluster_size):
# UMAP init
umap_model = UMAP(
n_neighbors=n_neighbors, n_components=5, min_dist=0.0, metric="cosine"
)
# HDBSCAN init
hdbscan_model = HDBSCAN(
min_cluster_size=min_cluster_size, metric="euclidean", prediction_data=True
)
# Remove noise from created topics
vectorizer_model = CountVectorizer(
stop_words=spacy_stopwords, min_df=0.03, max_df=0.99, ngram_range=(1, 2)
)
# BERTopic model init
representation_model = MaximalMarginalRelevance()
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_model,
verbose=True,
)
# Fit the model
topics, probs = topic_model.fit_transform(comments, embeddings)
# Get topics over time
topics_over_time = topic_model.topics_over_time(
comments,
timestamps,
datetime_format="%Y_%m_%d",
global_tuning=True,
evolution_tuning=True,
nr_bins=20,
)
# Plot Topics over Time
plot = topic_model.visualize_topics_over_time(
topics_over_time, top_n_topics=15, height=700, width=1200
)
return topics, probs, topics_over_time, plotExperiments
This part is purely experimental and requires a lot of time to tune hyperparameters of model to get the best ouput results. This is one of the main problems of Topic Modelling. There is no metric for helping us to choose the best hyperparameters. Also, the best result of modelling may be subjective. That is why we run a series of experiments to have several results.
Generally, hyperparameters should be chosen taking into account several goals:
- To preserve the local structure of the data after reducing the dimensionality of the data with UMAP.
- To reduce the amount of noise in clusters and create an adequate number of topics with HDBSCAN.
- To create a list of understandable topics at the output.
UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 35
topics, probs, topics_over_time, plot = topic_modelling(15, 35)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 12:43:52,525 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:44:13,632 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:44:13,635 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:44:00.052435] Transform can only be run with brute force. Using brute force.
2024-12-08 12:45:02,962 - BERTopic - Cluster - Completed ✓
2024-12-08 12:45:03,007 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:45:26,080 - BERTopic - Representation - Completed ✓
16it [03:11, 11.95s/it]
UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 50
topics, probs, topics_over_time, plot = topic_modelling(15, 50)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 12:48:42,689 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:49:05,180 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:49:05,184 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:48:50.374168] Transform can only be run with brute force. Using brute force.
2024-12-08 12:49:53,314 - BERTopic - Cluster - Completed ✓
2024-12-08 12:49:53,349 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:50:11,762 - BERTopic - Representation - Completed ✓
16it [02:23, 8.97s/it]
UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 75
topics, probs, topics_over_time, plot = topic_modelling(15, 75)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 12:52:40,079 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:53:02,998 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:53:03,002 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:52:47.786826] Transform can only be run with brute force. Using brute force.
2024-12-08 12:53:53,905 - BERTopic - Cluster - Completed ✓
2024-12-08 12:53:53,940 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:54:10,500 - BERTopic - Representation - Completed ✓
16it [01:41, 6.33s/it]
UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 35
topics, probs, topics_over_time, plot = topic_modelling(25, 35)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 12:55:56,336 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:56:20,594 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:56:20,597 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:56:05.075884] Transform can only be run with brute force. Using brute force.
2024-12-08 12:57:10,667 - BERTopic - Cluster - Completed ✓
2024-12-08 12:57:10,701 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:57:32,180 - BERTopic - Representation - Completed ✓
16it [02:35, 9.71s/it]
UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 50
topics, probs, topics_over_time, plot = topic_modelling(25, 50)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 13:00:12,350 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 13:00:33,530 - BERTopic - Dimensionality - Completed ✓
2024-12-08 13:00:33,534 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [13:00:20.726477] Transform can only be run with brute force. Using brute force.
2024-12-08 13:01:21,580 - BERTopic - Cluster - Completed ✓
2024-12-08 13:01:21,614 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 13:01:37,926 - BERTopic - Representation - Completed ✓
16it [01:56, 7.27s/it]
UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 75
topics, probs, topics_over_time, plot = topic_modelling(25, 75)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot2024-12-08 13:03:38,968 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 13:04:02,000 - BERTopic - Dimensionality - Completed ✓
2024-12-08 13:04:02,004 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [13:03:47.728912] Transform can only be run with brute force. Using brute force.
2024-12-08 13:04:50,250 - BERTopic - Cluster - Completed ✓
2024-12-08 13:04:50,287 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 13:05:05,903 - BERTopic - Representation - Completed ✓
16it [01:31, 5.74s/it]